Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

نویسندگان

  • Florin Sultan
  • Thu D. Nguyen
  • Liviu Iftode
چکیده

ÐIn this paper, we address the problem of garbage collection in a single-failure fault-tolerant home-based lazy release consistency (HLRC) distributed shared-memory (DSM) system based on independent checkpointing and logging. Our solution uses laziness in garbage collection and exploits consistency constraints of the HLRC memory model for low overhead and scalability. We prove safe bounds on the state that must be retained in the system to guarantee correct recovery after a failure. We devise two algorithms for garbage collection of checkpoints and logs, checkpoint garbage collection (CGC), and lazy log trimming (LLT). The proposed approach targets large-scale distributed shared-memory computing on local-area clusters of computers. In such systems, using global synchronization or extra communication for garbage collection is inefficient or simply impractical due to system scale and temporary disconnections in communication. The challenge lies in controlling the size of the logs and the number of checkpoints without global synchronization while tolerating transient disruptions in communication. Our garbage collection scheme is completely distributed, does not force processes to synchronize, does not add extra messages to the base DSM protocol, and uses only the available DSM protocol information. Evaluation results for real applications show that it effectively bounds the number of past checkpoints to be retained and the size of the logs in stable storage.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Garbage Collection in Distributed Object Systems

Large-scale, distributed computer systems are less reliable than their smallscale, non-distributed counterparts. A significant proportion of the reliability problems are due to the very poor scalability of manual storage management techniques—the reclamation of unused storage in particular. The bulk of automatic garbage reclamation in both message-passing and shared memory systems, with certain...

متن کامل

An efficient causal logging scheme for recoverable distributed shared memory systems

This paper presents a causal logging scheme for the lazy release consistent distributed shared memory systems. Causal logging is a very attractive approach to provide the fault tolerance for the distributed systems, since it eliminates the need of stable logging. However, since inter-process dependency must causally be transferred with the normal messages, the excessive message overhead has bee...

متن کامل

Homeless and Home-based Lazy Release Consistency Protocols on Distributed Shared Memory

This paper describes the comparison between homeless and home-based Lazy Release Consistency (LRC) protocols which are used to implement Distributed Shared Memory (DSM) in cluster computing. We present a performance evaluation of parallel applications running on homeless and home-based LRC protocols. We compared the performance between TreadMarks, which uses homeless LRC protocol, and our home-...

متن کامل

The Region Trap Library: Handling Traps on Application-Defined Regions of Memory

User-level virtual memory (VM) primitives are used in many different application domains including distributed shared memory, persistent objects, garbage collection, and checkpointing. Unfortunately, VM primitives only allow traps to be handled at the granularity of fixedsized pages defined by the operating system and architecture. In many cases, this results in a size mismatch between pages an...

متن کامل

Cyclic distributed garbage collection

With the continued growth of distributed systems as a means to provide shared data, designers are turning their attention to garbage collection, prompted by the complexity of memory management and the desire for transparent object management. Garbage collection in very large address spaces is a di cult and unsolved problem, due to problems of e ciency, fault-tolerance, scalability and completen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IEEE Trans. Parallel Distrib. Syst.

دوره 13  شماره 

صفحات  -

تاریخ انتشار 2002